Rethinking the Inception Architecture for Computer Vision
11.Conclusion
The combination of lower parameter count and additional regularization with batch-normalized auxiliary classifiers and label-smoothing allows for training high quality networks on relatively modest sized training sets.
最後の文でLabel Smoothingに言及
7.Model Regularization via Label Smoothing
kはラベル(1〜K)
p(k|x):訓練サンプルxについてモデルが計算するラベルkの確率
ground truthはq(k)=δ_k,y
k=yなら1、それ以外は0
Dirac deltaと言及(クロネッカーのデルタと同じと思われる)
Consider a distribution over labels u(k), independent of the training example x, and a smoothing parameter ε.
ラベルの分布を置き換える
q(k|x)=δ_k,y から q′(k|x) = (1 − ε)δ_k,y + εu(k) へ
ε * u(k)分加わる
fixed distribution u(k)
In our experiments, we used the uniform distribution u(k) = 1/K
We refer to this change in ground-truth label distribution as label-smoothing regularization, or LSR.